Video watermark detection

ABSTRACT

A method and system for detecting watermarks in video images including preparing a signal, extracting and calculating property values, detecting bit values and decoding a payload, where the payload is a bit sequence generated and embedded by enforcing relationships between property values in a volume of video are described.

FIELD OF THE INVENTION

The present invention relates to watermarking of video content and inparticular to embedding and detecting watermarks in digital cinemaapplications.

BACKGROUND OF THE INVENTION

Videos contain both a spatial and a temporal axis. Images (and similarlyvideo frames) can be represented in the spatial domain or in a transformdomain. In the spatial domain, also called the ‘baseband’ domain, imagesare represented as a grid of pixel values. The transform domainrepresentation of a pixeled (i.e., discrete) image can be computed froma mathematical transformation of the spatial domain image. In general,this transformation is perfectly reversible, or at least reversiblewithout significant loss of information. There are several transformdomains, the most well-known being the FFT (Fast Fourier Transform), theDCT (Discrete Cosine Transform), which is used in the JPEG compressionalgorithm, and the DWT (Discrete Wavelet Transform), which is used inthe JPEG2000 compression algorithm. One advantage of representingcontent in a transform domain is that the representation can generallybe more compact than the baseband representation for a similarperceptual quality. Watermarking methods exist for embedding watermarksin the baseband as well as in a transform domain.

Video or video images lend themselves to various watermarkingapproaches. These approaches to video watermarking can be grouped intothree categories, based on whether they select the spatial structure,the temporal structure, or the global three-dimensional structure of avideo for watermarking.

Spatial video watermarking algorithms extend still image watermarking tovideo watermarking via frame-by-frame mark embedding with existing imagewatermarking algorithms. In the prior art, the frame-by-frame watermarkis repeated in each frame on a certain interval, where the interval isarbitrary and can be a few frames up to the whole video. On the detectorside, it is advantageous for the Power Signal-to-Noise Ratio (PSNR) tohave the same watermark pattern repeated on a number of consecutiveframes. However, if every frame has the same watermark pattern, specialcare may have to be taken to avoid vulnerability to a possible framecollusion attack. On the other hand, if the watermark changes for everyframe, it can be harder to detect, while inducing flickering artifactsand still being vulnerable to collusion attacks in stable areas of thevideo.

As an improvement, it is not necessary to watermark every frame. In theprior art, only automatically selected ‘key frames’ (and the few framesaround the key frame) are watermarked. Key frames are stable framesfound between two boundary shots frames, and can be reliably locatedagain even after a change of frame rate. Watermarking only key framesnot only reduces the stress on the fidelity constraint but may alsoresults in more security and less computational intensity.

While spatial domain watermarks can benefit from still imagewatermarking techniques robust to geometric transformations, e.g. usinga geometrically invariant watermark, or replicating the watermark intiled patterns or using a template in the Fourier domain, it isdifficult to invert, notably due to the screen curvature and thegeometric transformations that occur during a camcorder capture of aprojected movie. Furthermore, these two approaches are not secureagainst signal processing attacks, for instance, a template in theFourier domain can easily be removed. Therefore, spatial domainwatermarks can be more easily and securely detected if the originalcontent is used for registration. In the prior art, a semi-automatedregistration method is used that matches feature points in the originalframe with feature points in the extracted frame. For projection on aflat screen, a minimum of four reference points must be matched forinverting the transformation. An operator manually selects at least fourfeature points from a set of pre-computed feature points. A two-levelregistration can be done entirely automatically: first in the temporaldomain, then in the spatial domain. A database of frame signatures (alsocalled fingerprints, soft hash or message digest) is accessed by thewatermark detector to match an extracted key frame with thecorresponding original frame. The latter is then used for automaticspatial registration of the test frame.

It should be noted, however, that the computations for the selection ofkey frames require upcoming frames, which are not available at the timeof watermark embedding for a real time application. An alternativemethod would be to maintain a constant time delay between frameprocessing and playback.

Prior art temporal watermarking schemes only exploit the temporal axisto insert a watermark, by varying the global luminance in each frame.That makes the watermark inherently robust to geometrical distortions,as well as simplifying the watermark reading after a camcorder attack.The robustness of the watermark to temporal low-pass filtering(typically applied when de-flickering a camcorded video) can be improvedwith other methods known in the art. However, the watermark can befragile to temporal de-synchronization (especially after frame editing).Synchronization, however, can also be recovered by matching key framesbetween the desynchronized and original video.

The two previous approaches (spatial or temporal watermarking) useeither one or two of the three available dimensions for watermarking.The absence of watermark structure in one or two of the three availabledimensions in a video results in a suboptimal use of the space availablefor a watermark. The method described in Bloom et al., U.S. Pat. No.6,885,757 “Method and Apparatus for Providing an Asymmetric WatermarkCarrier” makes complete use of the structure of a video. In theirspread-spectrum method, the technique is apparently robust and securebut the detector must synchronize the test video with the original videoprior to detection.

SUMMARY OF THE INVENTION

An aspect of the present invention involves pseudo-randomly insertingconstraint-based relationships between or among property values ofcertain coefficients over consecutive frames or within a single frame.The relationships encode the watermark information.

‘Coefficients’ are denoted as the set of data elements, which containthe video, image or audio data. The term ‘content’ will be used as ageneric term denoting any set of data elements. If the content is in thebaseband domain, the coefficients will be denoted ‘basebandcoefficients’. If the content is in the transform domain, thecoefficients will be denoted as ‘transform coefficients’. For example,if an image, or each frame of a video, is represented in the spatialdomain, the pixels are the image coefficients. If an image frame isrepresented in a transform domain, the values of the transformed imageare the image coefficients.

The present invention in particularly deals with DWT for JPEG200 imagesin digital cinema applications. The DWT of a pixeled image is computedby the successive application of vertical and horizontal, low-pass andhigh-pass filters to the image pixels, where the resulting values arecalled ‘wavelet coefficients’. A wavelet is an oscillating waveform thatpersists for only one or a few cycles. At each iteration, the low-passonly filtered wavelet coefficients of the previous iteration aredecimated, then go through a low-pass vertical filter and a high-passvertical filter, and the results of this process are passed through alow-pass horizontal and a high-pass horizontal filter. The resulting setof coefficients is grouped in four ‘subbands’, namely the LL, LH, HL andHH subbands.

In other words, the LL, LH, HL and HH coefficients are the coefficientsresulting from the successive application to the image of, respectively,low-pass vertical/low pass horizontal filters, low-passvertical/high-pass horizontal filters, high-pass vertical/low-passhorizontal filters, high-pass vertical/high-pass horizontal filter.

An image may have a number of channels (or components), that correspondto different native colors. If the image is in grayscale, then it hasonly one channel representing the luminance component. In general, theimage is in color, in which case three channels are typically used torepresent the different color components (though a different number ofchannels is sometimes used). The three channels may respectivelyrepresent the red, green and blue component, in which case the image isrepresented in the RGB color space, however, many other color spaces canbe used. If the image has multiple channels, the DWT is generallycomputed separately on each color channel.

Each iteration corresponds to a certain ‘layer’ or ‘level’ ofcoefficients. The first layer of coefficients corresponds to the highestresolution level of the image, while the last layer corresponds to thelowest resolution level. FIG. 1 is a video representation in onecomponent of a 5-level wavelet transform. Units 105-120 are frames of avideo. Unit 125 indicates the LL subband coefficients at the lowestresolution. Unit 125 a shows the coefficients at (f,c,l,b,x,y) withframe f=0, channel c=0, subband b=0, resolution level 1=0, and positionsx and y=0.

To best exploit the 3D structure of a video, the present invention usesboth the temporal and spatial axis. As spatial registration is hard toachieve for movies after projection and capture, the present inventionuses very low spatial frequencies or global properties of low spatialfrequencies, which are less sensitive to geometric distortions forspatial registrations. Temporal frequencies are more easily recovered asmost transforms occurring during attacks are time-linear.

In the present invention, the low-resolution wavelet coefficients of thevideo are directly watermarked. As the number of pixels in a frame is onthe order of 1000 times larger than the number of the lowest resolutionwavelet coefficients, the number of operations is potentially muchsmaller in the present invention.

A method and system for watermarking video images including generating awatermark and embedding the generated watermark into video images byenforcing relationships between property values of selected sets ofcoefficients with a volume of video are described. The watermarks arethereby adaptively embedded in the volume of video. A method and systemfor watermarking video images including selecting sets of coefficientsand enforcing relationships between property values of selected sets ofcoefficients with a volume of video are also described. A method andsystem for watermarking video images including generating a payload,selecting sets of coefficients, modifying coefficients and embeddingsaid watermark by enforcing relationships between property values ofselected sets of coefficients with a volume of video are also described.The modified coefficients replace the selected sets of coefficients

A method and system for detecting watermarks in video images includingpreparing a signal, extracting and calculating property values,detecting bit values and decoding a payload, where the payload is a bitsequence generated and embedded by enforcing relationships betweenproperty values in a volume of video are described. A method and systemfor detecting watermarks in video images including preparing a signaland decoding a payload, where the payload is a bit sequence generatedand embedded by enforcing relationships between property values in avolume of video are also described. A method and system for detectingwatermarks in a volume of video including preparing a signal, extractingand calculating property values and detecting bit values are alsodescribed.

While the present invention may be implemented in hardware, firmware,FPGAs, ASICs or the like, it is best implemented in software residing ina computer or processing device where the device may be a server, amobile device or any equivalent thereof. The method is bestimplemented/performed by programming the steps and storing the programon computer readable media. In the event that the speed required forreal-time processing requires hardware for one of more sequences ofsteps, a hardware solution for all or any part of the processes andmethods described herein can be easily implemented with no loss ofgenerality. The hardware solution can be then be embedded into acomputer or processing device, such as but without limitation a serveror mobile device. In an example of implementation for real-timewatermarking JPEG2000 images for digital cinema application, a JPEG2000decoder in a digital cinema server or projector delivers thecoefficients of the lowest resolution level of each frame to thewatermarking embedding module. The embedding module modifies thereceived coefficients and returns them to the decoder for furtherdecoding. The delivery, watermarking and return of coefficients areperformed in real-time.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is best understood from the following detaileddescription when read in conjunction with the accompanying drawings. Thedrawings include the following figures briefly described below wherelike-numbers on the figures represent similar elements:

FIG. 1 is a video representation in one component of a 5-level wavelettransform.

FIG. 2 is a flowchart depicting the payload generation step ofwatermarking.

FIG. 3 is a flowchart depicting the coefficient selection step ofwatermarking.

FIG. 4 is a flowchart depicting the coefficient modification step ofwatermarking.

FIG. 5 shows a video frame at full resolution and a video framereconstructed from coefficients at resolution level 5.

FIG. 6 is a block diagram of watermarking in a D-cinema server (MediaBlock).

FIG. 7 is a flowchart depicting video watermark detection.

FIG. 8 is a flowchart depicting signal preparation for video watermarkdetection.

FIG. 9 shows a cross-correlation function.

FIG. 10 is a flowchart depicting detection of bit values in videowatermark detection.

FIG. 11 shows an accumulated signal.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

A number of applications require real-time watermark embedding such assession-based watermark embedding for Set-Top Box and for Digital CinemaServer (or called Media Block) or Projector. While fairly obvious, it isworth mentioning that this renders it difficult to apply watermarkingmethods that, at a given time, exploit frames coming later in time.Offline pre-computations (for example of a watermark's location orstrength) should preferably be avoided. There are several reasons forthat, but the two most important ones are: potential security leaks(current generation watermarking algorithms are generally less secure ifthe attacker knows the full details of the embedding algorithm), andimpracticality.

In most applications, a unit of digitally watermarked content generallyundergoes some modification between the time it is embedded and the timeit is detected. These modifications are named ‘attacks’ because theygenerally degrade the watermark and render its detection more difficult.If the attack is expected to occur naturally during the application, theattack is considered ‘non-intentional’. Examples of non-intentionalattacks can be: (1) a watermarked image that is cropped, scaled, JPEGcompressed, filtered etc. (2) a watermarked video that is converted toNTSC/PAL SECAM for viewing on a television display, MPEG or DIVXcompressed, re-sampled etc. On the other hand, if the attack isdeliberately done with the intention of removing the watermark orimpairing its detection (i.e. the watermark is still in the content butcannot be retrieved by the detector), then the attack is ‘intentional’,and the party performing the attack is the ‘pirate’. Intentional attacksgenerally have the goal to maximize the chance of making the watermarkunreadable, while minimizing the perceptual damage to the content:examples of attacks can be small, imperceptible combinations of lineremovals/additions and/or local rotation/scaling applied to the contentto make very difficult its synchronization with the detector (mostwatermark detectors are sensitive to de-synchronization). Tools exist onthe internet for the above attack purposes, e.g. Stirmark(http:flwww.petitcolas.net/fabien/watermarking/stirmark/).

In the case of the so-called ‘camcorder attack’, which is performed by aperson illegally capturing a movie during playback in a theater, theattack is considered unintentional, even if the party performs anillegal action. Indeed, the movie capture is not done with the intent ofremoving the watermark. However, after its capture, the person may runadditional processes on the captured video to ensure that the watermarkcan no longer be detected in the content. These latter attacks are thenconsidered intentional.

For example, a session-based watermark for digital cinema must survivethe following attacks: resizing, letterboxing, aperture control,low-pass filtering and anti-aliasing, brick wall filtering, digitalvideo noise reduction filtering, frame-swapping, compression, scaling,cropping, overwriting, the addition of noise and other transformations.

Camcorder attacks include the following attacks in sequential order:camcorder capture, de-interlacing, cropping, de-flickering andcompression. Notably, camcorder capture introduces a significant spatialdistortion. The present invention is focused on the camcorder attackbecause it is generally recognized that a watermark surviving thecamcorder attack will survive most other non-intentional attacks, e.g. ascreener copy, telecine, etc. However, it is important as well that thewatermark survives other attacks. The frames of a video are generallyinterlaced for playing on NTSC or PAL SECAM compliant systems.De-interlacing, does not really impact the detection performance, but isa standard process used by pirates to improve the captured videoquality. A video of aspect ratio 2.39 is captured fully withapproximately a 4:3 aspect ratio; the top and bottom areas of the videoare roughly cropped. Captured videos typically exhibit a disturbingflicker, which is due to an aliasing effect in the time domain. Theflicker corresponds to quick variation of luminance, which can befiltered out. De-flickering filters are often used by pirates to removesuch flickering effects. Even if de-flickering filters are not used withthe intention of erasing a watermark, they can be very damaging to thetemporal structure of the watermark, because they strongly low passfilter each frame. Finally, captured movies are compressed to fit theavailable distribution bandwidth/media/format, e.g. DIVX or other lossyvideo formats. For example, movies found on P2P networks often have afile size allowing for storing an entire 100 minute movie on a 700Mbytes CD. This corresponds to an approximate total bit rate of 934kbps, or about 800 kbps if 128 kbps are kept for the audio tracks.

This sequence of attacks corresponds to the most severe processes thatwould occur during the lifetime of a pirated video that can be found ona peer-to-peer (P2P) network. It also includes, explicitly orimplicitly, most of the above-mentioned attacks that watermarks mustsurvive. In addition to the camcorder attack, the watermarking methodand apparatus of the present invention also survives frame-editing(removal and/or addition) attacks.

Watermarking detection systems are called ‘blind’ (or non-blind) if thedetector does not need (does need) access to the original content. Thereare also so called semi-blind systems that need access only to dataderived from the original content. Some applications such as forensictracking for session-based watermarks for digital cinema do notexplicitly require a blind watermark solution and access to originalcontent is possible as detection will typically be done offline. Thepresent invention uses a blind detector but inserts synchronization bitsin order to synchronize the content at the detector. Semi-blinddetectors can also be used with the present invention. If a semi-blinddetector is used, synchronization could eventually be performed usingthe data derived from the original content. In this case, thesynchronization bits would not be necessary, and the size of thewatermark, also called watermark chip, could be reduced.

In a specific example for digital cinema application, a minimum payloadof 35 bits needs to be embedded in the content. This payload shouldcontain a 16-bit timestamp. If a time stamp is generated every 15minutes (four per hour), 24 hours per day and 366 days/year, and thestamp repeats annually, there are 35,136 time stamps needed, which canbe represented with 16 bits. The other 19 bits can be used to representa location or serial number for a total 524,000 possiblelocations/serial numbers.

In addition, all 35-bits are required to be detectable from a fiveminute segment. In other words, no more than 5 minutes of video shouldbe required to extract the forensic mark. In one embodiment, the presentinvention uses a 64-bit watermark, and the watermark chip is repeatedevery 3:03 minutes. A video watermark chip embedded in 3:03 minutes ofvideo at 24 frames per second with one embedded bit per frame has 4392bits (183 seconds*24 frames per second=4392 frames=4392 bits at one bitper frame).

The video watermarking method of the present invention is based onmodifying the relationship between different properties of the content.Specifically, to encode bits of information, certain coefficients of animage/video are selected, assigned to different sets, and manipulated ina minimal way in order to introduce a relationship between the propertyvalues of the different sets. Sets of coefficients have differentproperty values, which generally vary in different spatio-temporalregions of a video, or are modified after processing the content. Ingeneral, the present invention uses property values that vary in amonotonic way, for which attacks have a predictable impact, because itis easier to ensure a robust relationship in that case. Such propertieswill be denoted as ‘invariant’. While the present invention is bestpracticed using invariant properties, it is not so limited and can bepracticed using properties that are not invariant. For example, theaverage luminance value of a frame is considered ‘invariant’ over time:it varies generally in a slow, monotonic way (except at boundary shots);furthermore, an attack such as contrast enhancement will generallyrespect the relative ordering of each frame's luminance value.

A video content is typically represented with multiple separatecomponents (or channels) such as RGB (red/green/blue, widely used incomputer graphics and color television), YIQ, YUV and YCrCb (used inbroadcast and television). YCrCb consists of two major components:luminance (Y) and chrominance (CrCb or also known as UV). The amount ofluminance or Y-component of a video content indicates its brightness.Chrominance (or chroma) describes the color portion of the videocontent, which includes the hue and saturation information. Hueindicates the color tint of an image. Saturation describes the conditionwhere the output color is constant, regardless of changes in the inputparameters. The chrominance components of YCrCb include the color-red(Cr) component and the color-blue (Cb) of the color. The presentinvention considers a video content as multiple 3D volumes ofcoefficients with the size of W*H*N (where W, H are the width, height ofa frame in the baseband domain or in a transform domain, respectively,and N is the number of frames of the video). Each 3D volume correspondsto one component representation of a video content. The watermarkinformation is inserted by enforcing constraint-based relationshipsbetween certain property values of selected sets of coefficients withinone or more volumes. However, as the human eye is much less sensitive tothe overall intensity (luminance) changes than to color (chrominance)changes, a watermark is preferably embedded in the 3D video volumerepresenting the luminance component of a video content. Anotheradvantage of luminance is that it is more invariant to transformationsof the video. Hereinafter, a 3D video volume represents the luminancecomponent unless otherwise specified, although it can represent anycomponent.

In the present invention, a set of coefficients can contain any numberof coefficients (from one to W*H*N) taken from arbitrary locations inthe content. Each coefficient has a value. Therefore different propertyvalues can be computed from a set of coefficients—some examples aregiven below. To insert the watermark information, a number ofrelationships can be enforced by varying the coefficient values in anumber of sets of coefficients. A relationship is to be understood in anon-limiting way, as one or a set of conditions that one or moreproperty values of one or more sets of coefficients must satisfy.

Various types of properties can be defined for each set of coefficients.Properties are calculated preferably in the baseband domain (such asbrightness, contrast, luminance, edge, color histogram) or in transformdomain (energy in a frequency band). Some property values can becalculated equally in the baseband and transform domains, as is the caseof luminance.

One suitable way to embed a bit of information is by selecting two setsof coefficients, and enforcing a pre-defined relationship between theirproperty values. The relationship can be, for instance, that oneproperty value of the first set of coefficients is greater than thecorresponding property value of the second set of coefficients. However,it is noted that there are several variations in the ways to embed bitsof information. One way to embed more than one bit of information in thetwo selected sets of coefficients is to enforce relationships betweenthe values of more than one property of the two sets of coefficients.

It is also possible to embed a bit of information by using only one setof coefficients, and enforcing a relationship of a property value ofthis set of coefficients. For instance, the property value can be set tobe greater than a certain value, which may be predefined or adaptivelycomputed from the content. It is also possible to embed more than twobits of information using one set of coefficients, by defining fourexclusive intervals, and enforcing the condition that the property valuelies in a certain interval. Other ways to embed more than one bitinclude using more than one property value, and enforcing a relationshipfor each of the property values.

In general, the basic scheme can be generalized to an arbitrary numberof sets of coefficients, an arbitrary number of property values and anarbitrary number of relationships to be enforced. While this can beadvantageous to embed higher quantities of information, specifictechniques such as linear programming may have to be used in order toensure that the various relationships are enforced simultaneously with aminimal perceptual change. As noted above, it can be easier to enforce arelationship if invariant property values are used.

Many properties in a 3D video volume (and set of coefficients) arerelatively invariant in a spatio-temporal way and/or before/afterprocessing of the content. Examples of invariant properties include:

-   -   Coefficients (e.g. wavelet coefficients) in consecutive frames        or different sub-bands of the same frame    -   Average luminance values in consecutive frames    -   Average texture feature value in consecutive frames    -   Average edge measure in consecutive frames    -   Average color or luminance histogram distribution in consecutive        frames.

Energy in a certain frequency range

-   -   Any of the above invariant properties in an area defined by        extracted feature points

Watermarking algorithms generally operate with a secret ‘key’, which isknown only to the embedder and detector. Using a secret key bringssimilar advantages as in cryptographic systems: for instance, thedetails of the watermarking system can be, in general, known withoutcompromising the security of the system, therefore algorithms can bedisclosed for peer review and potential improvement. Furthermore, thesecret of the watermarking system is held in a key, i.e. one can onlyembed and/or detect the watermark if the key is known. Keys can moreeasily be hidden and transmitted because of its compact size (typically128 bits). A symmetric key is used to pseudo-randomize certain aspectsof the algorithm. Typically, the key is used to encrypt the payload(e.g. using a standard cryptographic algorithm such as DES) after it hasbeen encoded for error correction and detection, and expanded to fit thecontent. For the method of the present invention, the key can also beused to set the relationships, which will be inserted between theproperty values of two different sets of coefficients. Therefore, theserelationships are considered to be ‘pre-defined’, as they are fixed fora given secret key. If there is more than one pre-defined relationshipfor embedding the watermark, the key can also be used to randomly selectthe precise relationship, for a given bit of information and given setsof coefficients.

The selected sets of coefficients generally correspond to ‘regions’,where a region is to be understood as a set of coefficients located inthe same area of the content. While regions of coefficients maycorrespond to spatio-temporal regions of the content, as is the case ofbaseband coefficients and wavelet coefficients, it is not necessarilythe case. For instance, the 3D Fourier transform coefficients of thecontent correspond to neither a spatial nor a temporal region, but itwould correspond to a region of similar frequencies.

For example, a set of coefficients may correspond to a region, which canbe made of all the coefficients in a certain spatial area for one frame.To encode a bit of information, two regions from two consecutive framesare selected and their corresponding coefficient values are modified toenforce a relationship between certain properties of these two regions.It is noted, as will be explained in further detail below, that it maynot be necessary to modify the coefficient values if the desiredrelationship already exists.

For yet another example, with wavelet transform there are four waveletcoefficients (LL, LH, HL and HH) corresponding to the four subbands foreach position and each component (channel) at each resolution level foreach frame. A set of coefficients may just contain one coefficient inone of the four subbands. Assume that C1, C2, C3, C4 are the fourcoefficients located at the same position, channel and resolution levelbut in four subbands, respectively. One method to embed watermark is toenforce a relationship between C2 and C3, which corresponds to thecoefficients in HL and LH subbands, respectively. One example of therelationship is that C2 is greater than C3. Another method to embedwatermarks is to enforce relationships between C1-C4 in a frame and thecorresponding coefficients in the consecutive frame. A variation on thisprinciple is by inserting a relationship for only one type ofcoefficient, where the coefficient must be greater than a pre-computedvalue. For instance, for all positions in a frame at a certainresolution level it is possible to enforce a constraint that the valueof coefficient LL is greater than a pre-computed value. In the aboveexamples, the property value is the value of a wavelet coefficientitself.

It is essential to be able to identify the same, or nearly the same setsof coefficients on the detection side as on the watermarking side.Otherwise, the wrong coefficients would be selected and the measuredproperty value would be erroneous. Identifying the correct coefficientsis generally not a problem if the content has been mildly processedbefore detection, in which case the location of the coefficients(whether in a spatial or transform domain) has not changed. However, ifthe processing changes the geometrical or temporal structure of thecontent, as is generally the case during a camcorder attack, thecoefficients are likely to change location.

If there is a change in the temporal structure of the content, one caneither use a non-blind or semi-blind scheme, to resynchronize thecontent. Different methods are available in the prior art for thatpurpose. If the detection must be done blindly (i.e. without access toany data derived from the original content) it is possible to insertsynchronization bits with a predictable value in the content, which willbe used by the detector for resynchronizing the content. Such a schemewill be described in further detail below.

To ensure robustness to changes in the geometrical structure of thecontent, synchronization/registration methods, known in the prior art,which restore the modified content by matching the locations in themodified content to the corresponding location in the original contentcan be used. Changes in the geometrical structure of the content occur,for example, after rotation, scaling and/or cropping of the content inthe case where the original content, or where some data derived from itare available (e.g. a thumbnail or some characteristic information ofthe original content),

In the case of blind detection, one possibility is to use very lowspatial frequencies. For a video frame or an image, one region ofcoefficients may correspond to a full video frame, a half or a quarterof the frame. In this case, most of the coefficients will be correctlyselected (all coefficients, if the region corresponds to a full videoframe), and the detection is generally robust even if some coefficientsare assigned to the wrong set.

Another way to be inherently robust to a change in the geometricalstructure is to use regions that actually contain only one coefficient,and to enforce a relationship between one coefficient in one frame andone coefficient at the corresponding position in the next frame. If thesame relationship is enforced for all coefficients in the two frames,one can easily see that the detection is inherently robust togeometrical distortions. A related way to ensure robustness to a changein geometrical structure is to create relationships between thedifferent wavelet coefficients at a given location in differentsub-bands. For example, in wavelet transform there are four coefficientscorresponding to the four subbands (LL, LH, HL and HH) for eachresolution level, each position and component (channel). The samerelationship between two coefficients for all positions in a frame maybe enforced at a certain resolution level to embed a watermark bit forstrengthening the watermark robustness. On the detection side, thenumber of times that the relationship is observed as an indicator ofwhich bit was embedded.

Yet another way to ensure robustness to changes in the geometricalstructure is to use feature points that are invariant to changes in thegeometrical structure. Here, invariant means when, using a certainalgorithm to extract feature points of a video or image, the same pointsare found on the original and on the modified content. Different methodsare known in the prior art for that purpose. Those feature points can beused to delimit the regions of coefficients in the baseband and/ortransform domain. For example, three adjacent feature points delimit aninternal region, which can correspond to a set of coefficients. Also,three adjacent feature points can be used to define sub-regions, witheach sub-region corresponding to a set of coefficients.

Yet another way to be inherently robust to a change in the geometricalstructure is to enforce the relationships between the value of a globalproperty of all coefficients in one frame and the value of the sameglobal property of all coefficients in a second frame. It is assumedsuch global property is invariant to the change in the geometricalstructure. An example of such global property is the average luminancevalue of one image frame.

A non-limiting exemplary algorithm that embeds bits by enforcingconstraints between property values of two consecutive frames of a videois as follows:

For each frame which is a JPEG2000 compressed image in a sequence offrames F1, F2, . . . Fn of video:

-   -   a) Select a region, which consists of N coefficients at the        resolution level L. The coefficients may belong to one or more        subbands, such as LL, LH, HL and HH. The region can be of        arbitrary but fixed shape (e.g. rectangle shape) or as described        above can vary depending on the original image content, using        for example feature points for additional stability of the        region when facing geometric attacks.    -   b) Determine the relevant global property for the region. A        global property may be an average luminance value, an average        texture feature measure, an average edge measure, or an average        histogram distribution of the region. P is the value of such a        global property.        For embedding a bit sequence {b1, b2, . . . bm}:    -   a) If bi (1≦i≦m) is 0, modify F_(2*i) and F_(2*i+1) in a minimal        way (only if necessary) such that P(F_(2*i+1))>P(F_(2*i)).    -   b) Else If bi (I≦i≦m) is 1, modify F_(2*i) and F_(2*i−1) in a        minimal way (only if necessary) such that        P(F_(2*i+1))<P(F_(2*i)).

This algorithm can be extended to embed multiple bits per frame, byinserting relationships between several property values of the twoframes.

For watermark detection:

-   -   a) Synchronize the captured video in the temporal domain. This        can be done either using synchronization bits, a non-blind or        semi-blind scheme.    -   b) Select a region which consists of N coefficients at the        level L. Similarly to embedding, the region can be of fixed        shape.    -   c) Calculate the relevant global property for the region. P′ is        the value of the global property of the region.    -   d) A bit 0 is detected if P′(F_(2*i+1))>P′(F_(2*i))    -   e) A bit 1 is detected if P′(F_(2*i+1))<P′(F_(2*i))

Watermarking in the present invention is separated into three steps:payload generation, coefficient selection, and coefficient modification.The three steps are described in detail below as an exemplary embodimentof the present invention. It should be noted that a great deal ofvariation is possible for each of these steps, and the steps and thedescription are not intended to be limiting.

Referring now to FIG. 2, which is a flowchart depicting the payloadgeneration step of watermarking, a secret key is retrieved or receivedin step 205. Information including a time stamp and a number identifyinga location or serial number of a device are retrieved or received atstep 210. The payload is generated at step 215. The payload for adigital cinema application is a minimum of 35 bits and in a preferredembodiment of the present invention is 64 bits. The payload is thenencoded for error correction and detection, for example, using BCHcoding at step 220. The encoded payload is optionally replicated at step225. Optionally, then synchronization bits are generated based on thekey at step 230. Synchronization bits are generated and used when usingblind detection. They may also be generated and used when usingsemi-blind and non-blind detection schemes. If synchronization bits weregenerated then they are assembled into a sequence at step 235. Thesequence is inserted into the payload at step 240 and the entire payloadis then encrypted at step 245.

Payload generation includes translating the concrete information to beembedded into a sequence of bits, which we call the “payload”. Thepayload to be embedded is then expanded through the addition of errorcorrection and detection capabilities, synchronization sequences,encryption and potential repetitions depending on the available space.An exemplary sequence of operations for payload generation is:

-   1. Translate “information” to be embedded into an “original    payload”. Transform information (timestamp, projector ID, etc.) into    payload. An example was given above for creating a 35 bit payload    for a digital cinema application. In an exemplary embodiment of the    present invention, the payload has 64 bits. Compute “encoded    payload” from original payload, the encoded payload includes error    correction and detection capabilities. Various error correction    codes/methods/schemes can be used. For example, BCH coding. The BCH    code (64,127) can correct up to 10 errors in the received bit stream    (i.e. approximately 7.87% error correction rate). However, if the    encoded payload is repeated a number times, a greater number of    errors can be corrected thanks to the redundancy. In an exemplary    embodiment of the present invention, the 127-bit repeated encoded    payload is repeated 12 times, and it is possible to correct up to    30% errors in the individual bits embedded in each frame.-   2. Depending on available space, replicate the encoded payload to    obtain “replicate encoded payload”. In the present invention,    replicate each of the encoded bits twelve times for a total of 127    (BCH coding)*12=1524 bits.-   3. Using a key, encrypt the replicated encoded payload; to obtain    “encrypted payload”; the encrypted payload is typically the same    size as the replicated encoded payload.-   4. (Optionally, prior to encryption) Generate synchronization bits    and insert at various places in the repeated encoded payload; the    resulting sequence is the video watermark payload. For example,    compute a fixed synchronization sequence with 2868 bits. This    sequence is split into one global synchronization unit of 996 bits    (as the header of the watermark chip) and 12 local synchronization    units of 156 bits (for the headers of each payload). In this    example, a large number of bits are used as synchronization bits.    While it is possible to reduce the amount of synchronization bits    significantly if we were to use a non-blind method (wherein the    original content is used for temporally synchronizing the test    content) at the detector, the synchronization bits are still very    useful for locally adjusting registration. In other words,    synchronization bits do take space that could be otherwise used for    additional redundancy of the information and thereby increase    robustness to individual bit errors. However, synchronization bits    increase the precision and quality of the extracted information,    which results in less individual bit errors. The number of inserted    synchronization of bits is therefore set as the best compromise    resulting in the smallest number of errors in the 127 encoded bits.-   5. Assemble the watermark chip by concatenating the following bits    in order:    -   Global synchronization (996 bits) synchronization unit.    -   First 127 bits of encrypted payload, then first local        synchronization unit (156 bits)    -   Second 127 bits of encrypted payload, then second local        synchronization unit (156 bits)    -   . . .    -   Last 127 bits of payload, then last local synchronization unit        (156 bits)

The watermark chip (e.g., 4392 bits) is typically a few orders ofmagnitude larger than the original payload (e.g., 64 bits). This allowsrecovery from the errors that occur during transmission on a noisychannel.

Referring now to FIG. 3, which is a flowchart depicting the selection ofcoefficients for watermarking, the key is retrieved or received at step305. The payload (encrypted, synchronized, replicated and encoded) isretrieved at step 310. The coefficients are then divided into disjointsets based on the key at step 315. Based on the payload bit and the key,the constraint between property values is determined at step 320.

The selection of coefficients can occur in the baseband or in atransform domain. The coefficients in a transform domain are selectedand grouped in two disjoint sets C1 and C2. A key is used to randomizethe coefficient selection. A property value for each of the two sets,P(C1) and P(C2) is identified, such that it is generally invariant forC1 and C2. A variety of such properties can be identified, for example,average value (e.g. luminance), maximum value, and entropy.

The key and bit to be inserted are used to establish the relationshipbetween the values of a property of C1 and C2, for instance P(C1)>P(C2).This is called constraint determination. For additional robustness, apositive value ‘r’ can be used such that P(C1)>P(C2)+r. The relationshipmay already be in place, in which case the coefficients need not bemodified. In the worst case, P(C2) may be significantly larger thanP(C1), for instance, if P(C2) is already greater than P(C1)+t where t isa pre-determined value or determined according to a perceptual model, inwhich case it is not worth changing the coefficients because it mayintroduce perceptual damage. In most cases though, P(C1) will become P′1=P(C1)+p1, and P(C2) will become P′2=P(C2)−p2 (p1 and p2 are positivenumbers), such that P′1>P′2+r.

Referring now to FIG. 4, which is a flowchart depicting the coefficientmodification step of watermarking, at step 405, the disjoint sets ofcoefficients are received or retrieved. The property values for thedisjoint sets of coefficients are measured at step 410. The propertyvalues are tested at step 415 to determine the distance between them,which is a measure of the robustness. If the property values are withina threshold distance, t, then proceed to step 420 because no coefficientmodification is necessary. If the property values are greater than thethreshold distance, r, then a further test is performed at step 425 todetermine if the property values are within certain maximum distancesallowed in order to perform coefficient modification. If the propertyvalues are within the maximum distances then the coefficients aremodified to satisfy the constraint relationship at step 435. If theproperty values are not within the maximum distances then thecoefficients are not modified as prescribed by step 430.

The watermarking method of the present invention is “adaptive” to theoriginal content, because the modifications to the content are minimalwhile ensuring that the bit value will be correctly detected. Spreadspectrum watermarking methods can be also adaptive to the originalcontent, but in a different way. Spread spectrum watermarking methodstake account of the original content to modulate the change such that itdoes not lead to perceptual damage. This is conceptually different fromthe method of the present invention, which may decide not to insert anychange at all in certain areas of the content, not because suchmodifications would be perceptible, but because the desired relationshipalready exists or because the desired relationship cannot be set withoutsignificantly deteriorating the content. As will be seen below, themethod of the present invention can, however, be made adaptive both forensuring that that the bit will be correctly decoded and to minimize theperceptual damage.

Because the method of the present invention introduces a minimal amountof distortion to ensure that a bit is robustly embedded, and gives up incases where the distortion would be too severe, it would lead to agreater robustness than the spread spectrum methods for the samedistortion and bit rate.

In the baseband domain, one embodiment of the present invention dividesthe pixels in each frame into a top part and a lower part. The luminanceof the top/lower part is increased or decreased depending on the bit tobe embedded. Each frame is split into four rectangles in the spatialdomain from the center point. Splitting the frame into four rectanglesallows storage of up to four bits per frame. The method includes:

-   -   Grouping pixel values into top part of a frame and lower part of        a frame, to form two sets of coefficients C1 and C2.

Measuring the luminance, i.e. P(C1) is the average of all coefficientsin C1, and same for C2.

Modifying the pixel values only if required, and in a minimal way to setthe constraint, e.g. P(C1)>P(C2)+r, where r is generally a positivevalue.

In this embodiment of the present invention, the watermark embeddingmodule only has access to the lowest resolution coefficients of thewavelet transformation of the image. For video frames with pixel size2048 (width)×856 (height) pixels, there are 64×28=1728 coefficients foreach subband at resolution level 5 (i.e. LL, LH, HL and HH), or1728*4=6912. Only these coefficients, or a subset of these coefficients,are used for video watermark embedding. Two non-limiting methods aredescribed below using groups of coefficients selected within a frame.

In the first method, only the LL coefficients (also called approximationcoefficients) are used for video watermark embedding. The LL coefficientmatrix (64×28) is split into four tiles/parts from the center point. C1,C2, C3 and C4 of 32×14 each. Depending on the bit to be embedded and thekey, a certain relationship is created between the coefficients of eachof the four parts LLa (top left region), LLb (top right), LLc (bottomright) and LLd (bottom left) by increasing/decreasing coefficients ofeach part such that a certain constraint is met. Each of the fourrectangular tiles/parts can have between 286 and 1728 coefficients foreach of the three color channels. To smoothen the watermark (and limitits visibility) at the transition between regions LLa to LLd, atransition region can be left non-watermarked or watermarked with alowered strength.

An example of constraint can be: P(C1)+P(C2)>P(C3)+P(C4). While it isnoted that for a linear property such as average luminance, thisequation can be written as P(C1 union C2)>P(C3 union C4) where there areonly have two regions instead of four, this is generally not true for anon-linear property such as the maximum value of all coefficients. Thereare several different possible constraints depending on the bit to beembedded and the key used.

One advantage of the separation of the coefficients into four tiles isthat, besides allowing for introducing constraints, it also allows theuse of very low spatial frequencies. As explained above, thesefrequencies are robust to geometric attacks, while allowing for storinga higher number of bits than a method that would consider only a globalproperty of the frame.

Coefficients LH and HL in the second method are used for video watermarkembedding. There are various ways to manipulate these coefficients inorder to insert constraints. A bit is embedded by inserting a constraintbetween coefficients LH and HL at the lowest level of resolution. Forinstance, the constraints can be such that for all x,y, in a frame fcoefficients LH(x,y,f)>HL(x,y,f). As such a constraint is often toostrong to be literally applied in practice, the coefficients can bemanipulated such that the relationship globally applies. For instance,it can be such that:

Sum(x,y)LH(x,y,f)>Sum(x,y)HL(x,y,f).

Or

Sum(x,y)(LH(x,y,f)>HL(x,y,f))

It should be noted that the second relationship is not linear, andallows for a finer grain but more complex insertion of constraints. Thisallows for distributing the change to coefficients such that areas moresensitive to changes not changed as much, if at all.

It should be noted that in this method instead of modifying pixelvalues, a relatively small number of coefficients (64×28 LLcoefficients) are modified to change the luminance of a frame. This is agreat advantage for watermark embedding, especially in an application,which has limited computational resources and requires cost-effectiveand real-time watermarking function.

Several more methods can be imagined, depending on the sets ofcoefficients, which can use coefficients in one frame only orcoefficients from successive frames, the measured property, the type ofrelationship to enforce, etc. In general, the most workable methods willuse sets of coefficients with mostly invariant properties, in the sensethat the ordering of property values is generally preserved aftermodification to the content

For coefficient modification, the present invention in one embodimentuses two sets of coefficients C1={c11, . . . , c1N} and C2={c21, . . . ,c2N}, and modifies their value. The values of coefficients cij, aredenoted v(cij) and v′(cij) before, and after the modificationrespectively.

As discussed above, more than two sets of coefficients can be used formore sophisticated relationships. It is also possible to use just oneset of coefficients. Without loss of generality, it may be desirable toset the relationship that P(C1)>P(C2)+r, where r is any value thatadjusts the robustness of the relationship.

If function P is for instance the maximum, then to minimize the changesonly manipulate the strongest coefficient of C1 and C2 in the followingway:

-   -   If c1i=max{c11, . . . , c1N} then v′(c1i)=v(c1i)+a1, else        v′(c1i)=v(c1i)    -   If c2j=max{c21, . . . , c2N} then v′(c2j)=v(c2j)+a2, else        v′(c2j)=v(c2j)    -   With a1 and a2 such that v′(c1i)>v′(c2j)+r.

The function P above is strongly non-linear, i.e., the property does notvary smoothly as a function of the coefficients values. This method isadvantageous because it allows embedding of a bit by modifying only onecoefficient per set (albeit the change may have to be strong).

An extension of this ‘maximum’ method that can make it more robust, isto vary not only the maximum value but the N strongest values (with Ntypically significantly smaller than the size of the set ofcoefficients), to maximize the chance that the relationship is correctlydecoded after manipulations to the content. It is understood thatseveral other variations are possible to this technique.

On the other hand, if function P is a linear property of thecoefficients (e.g. the average), the change can be distributedarbitrarily on all the coefficients in each set. Suppose, for example,that to set the relationship it is desirable to change the average valueof coefficients such that:

avg{v′(c11), . . . , v′(c1N)}>avg{v′(c21), . . . , v′(c2N)}+r

then if the change can be distributed equally on each coefficient(positively for coefficients belonging to C1 and negatively forcoefficients belonging to C2), resulting in:

v′(c1i)=v(c1i)+(r+avg{v(c21), . . . , v(c2N)}−avg{v(c11), . . . ,v(c1N)})/N

and similarly for c2j. If the relationship already holds, then(r+avg{v(c21), . . . , v(c2N)}−avg{v(c11), . . . , v(c1N)})<0 in whichcase the coefficients need not be modified.

As described above, the basic method can be extended to incorporate morerelationships by using different properties. Consider, for example, the‘maximum’ and ‘average’ methods together, to have four combinations ofrelationships between two sets, which allows for encoding two bits.Then, the following relationship may be enforced:

Max(C1)>max(C2) and avg(C1)<avg(C2)

Also, as described above, only one set of coefficients may have to beused, in which case the relationship is set against a fixed orpre-determined value. For instance, the relationship may be enforcedsuch that the maximum or average of C1 is higher than a certain value.In another case, a key may be used to pseudo-randomly choose to enforceeither a ‘maximum’ or an ‘average’ relationship depending on the key,which significantly enhances the security of the algorithm.

The above-described approach can incorporate a masking (perceptual)model, that allows for distributing the strength of the watermark ineach region of the image resulting in a minimal perceptual impact of thewatermark. Such model may also determine if a manipulation is possiblein order to enforce a relationship without perceptual damage. Thefollowing describes non-limiting ways to incorporate a masking model forvideo content in the context of real-time watermarking in a digitalcinema projector.

There are two main masking effects for images: texture masking andbrightness masking. Furthermore, videos benefit from a third maskingeffect: temporal masking.

In some applications such as digital cinema, which has limitedcomputational resources but requires real-time watermarking, it can bedesirable to only exploit the LL, LH, HL and HH subband coefficients ofthe lowest resolution level, e.g., at the resolution level 5.—The lastthree types of coefficients are potential indicators of texture while LLis an indicator of brightness. However, the corresponding resolution islow, and at this resolution the texture masking effects are notsignificant. To illustrate this, let, us compare a video frame at fullresolution, and the same video frame reconstructed from coefficients atresolution level 5. See FIG. 5. It seems that most of the texture islost at this resolution. Therefore, the LH, HL and HH subbandcoefficients for level 5 are poor indicators of texture, and will not beused measure texture masking.

However, temporal masking can still be estimated with a fairly goodprecision, as movement is generally applied to rather large areas of thevideo, which are therefore of low frequency. Temporal masking can bemeasured by subtracting coefficients of the previous frame fromcoefficients of the current frame. C(f,c,l,b,x,y) denotes thecoefficient of frame f, channel (i.e. color component) c, resolutionlevel 1, subband b (b=0 to 3 for coefficients LL, LH, HL and HH),position x,y. Thus, the sum of the absolute difference betweencoefficients of the same type on two successive frames is a validmeasure of temporal change:

T(f,c,l,b,x,y)=avg(c=1 . . .3)sum(b=0.3)(abs(C(f,c,l,b,x,y)−C(f−l,c,l,b,x,y))

For a given frame f, resolution level 1=5, T(f,c,l,b,x,y) is measuredfor all positions (x,y) and for each of the colour channels (there aretypically three color channels/components). If there are severalchannels, it can be advantageous to take the average value ofT(f,c,l,b,x,y) over all channels. Then for each position (x,y), thevalue of T(f,c,l,b,x,y) is compared to a threshold t, and thecoefficients at this position are modified only if the value is higherthan t. Experimentally, a good value for t is 30. If coefficients arechanged, the amount of change can be made as a function of theluminance, as is known in the prior art.

FIG. 6 is a block diagram of watermarking in a D-Cinema server (MediaBlock). Media Block 600 has modules, which may be implemented ashardware, software firmware etc. for performing watermarking includingat least watermark generation and watermark embedding. Module 605performs watermark generation including payload generation. Encodedwatermark 610 is then forwarded to watermark embedding module 615, whichreceives the coefficients of the image from J2K decoder 625 and thenselects and modifies wavelet coefficients 620, and finally returns themodified coefficients to J2K decoder 625.

As described above, a watermark generation module produces the payload,which is a sequence of bits to be directly embedded. The watermarkembedding module takes the payload as input, receives the waveletcoefficients of the image from a J2K decoder, select and modify thecoefficients, and finally returns the modified coefficients to the J2Kdecoder. J2K decoder continues to decode the J2K image and output thedecompressed image. As an alternative design, watermark generationmodule and/or watermark embedding module can be integrated into the J2Kdecoder.

The watermark generation module can be called periodically (e.g. every 5minutes) in order to update the timestamp in the payload. Therefore, itcan be called “off-line”, i.e. a watermark payload may be generated inadvance in the D-Cinema Server. In any case, its computationalrequirements are relatively low. However, the watermark embedding mustbe performed in real-time and its performance is critical.

The video watermark embedding can be done with various levels ofcomplexity in the way the original content is taken into consideration.More complexity may mean additional robustness for a given fidelitylevel or more fidelity for the same robustness level. However, it comeswith an additional cost in terms of the amount of computation.

Before estimating the number of required operations for video watermarkembedding, it is noted that any of the following basic computationalsteps are considered one operation:

Bit shifting of coefficient

Addition or subtraction of two coefficients

Multiplication of two integer numbers

Comparison of two coefficients

Accessing a value in a lookup table

In the following example, C(f,c,l,b,x,y) and C′(f,c,l,b,x,y) are theoriginal coefficient and watermarked coefficient at position x (width),y(height) for the frequency band b (0: LL, 1: LH, 2:HL, 3:HH) at thewavelet transformation level 1 for color channel c for frame f,respectively. Furthermore, it is assumed that N is the number ofcoefficients at the lowest resolution level, which need to be modified.

For the sake of simplicity, it is assumed in the following that acoefficient value is increased during video watermark embedding.However, it is noted that in equations an addition could equally bereplaced by a subtraction.

If each coefficient is changed by the same amount, then there is,therefore, only one operation per coefficient:

C(f,c,l,b,x,y)=C(f,c,l,b,x,y)+a

where the value a is a constant number. One additional comparisonoperation may be required to check the overflow of the modifiedcoefficient. Thus, the total computation requirement would be 2*N.

However, the above is not an effective method. Indeed, if the constantvalue a is too large, the watermark will become visible. Therefore, thevalue a must be conservative, i.e. it must be low enough such that thewatermark will never result in visible artifacts, but on other hand ifthe video watermark is too conservative, it may not survive seriousattacks. The LL subband coefficient corresponds to local luminance,while LH, HL and HH coefficients correspond to image variations, or“energy”. It is well known that the human eye is less sensitive tochanges in luminance in bright areas (stronger LL coefficient). It isalso less sensitive to changes in area with strong variations, which,depending on the direction of the variation, depend on coefficients LH,HL and HH. This however should be considered carefully: LH and HLcoefficients may correspond to perceptually significant changes such asedges, which have to be manipulated with care.

Nevertheless, it can be advantageous to make a modification that isproportional to the coefficient, at least for coefficients LL and HH. Asimple proportional modification can be done by copying the originalcoefficient, bit-shifting the copied coefficient, and adding orsubtracting the bit-shifted coefficient, e.g.

C′(f,c,l,b,x,y)=C(f,c,l,b,x,y)+bitshift(C,n)

A typical value for n would be 7 or 8. For n=7 or 8, the coefficient ismodified by 1/128 or 1/256 of its original magnitude. For example, foran image with an average luminance of 128 on a scale of 0 to 255, theimpact of the coefficient modification would be a change of luminanceof 1. Such a change typically does not create visible artifacts.

There are two operations per coefficient. With the possible overflowchecking, the total computation requirements would be 3*N where N is thenumber of manipulated coefficients.

It is also noted that it is possible to impose a minimum change a, tomake sure that for frames with very low luminance the watermark issufficiently strongly embedded. In this case there are three operationsper coefficient: C′(f,c,l,b,x,y)=C(f,c,l,b,x,y)+max(bitshift (C,n),a).

Additionally, the following perceptual features can be used to makeadaptive changes on coefficients:

-   -   Temporal context. Temporal masking is related to temporal        activity, which is best estimated by using coefficients in the        previous, current and following frames the present invention        uses only coefficients of the preceding and current frame to        measure temporal activity. A high temporal activity allows for a        stronger watermark. The estimated computational complexity for        temporal modelling is about four.

Texture context. For each coefficient C(f,c,b,l,x,y), K additionalcorresponding coefficients in other subbands may be used to model thetexture and flatness, with an estimated complexity of 4K² operations.

Luminance context. A lookup table can be used to determine weightaccording to the luminance at the coefficient C(f,c,b,l,x,y). Theestimated operation is B where B is the number of bits representing theluminance value.

All perceptual features can be weighted and balanced to determine themodification of the coefficient:

C(f,c,b,l,x,y)′=C(f,c,b,l,x,y)*(1+W)

where W is the weight combining all perceptual features.

Rough estimates of watermark embedding complexity, where for conveniencecomplexity is estimated in terms of number of operations as describedabove. It should be noted that the number of operations can varyaccording to the exact way an operation is defined, the implementedwatermarking and masking procedure, etc. Nevertheless, it can beconcluded that, given the relatively small number of coefficients whichneed to be accessed by the method of the present invention (on the orderof 1/1000 of an image size), and the relatively small number ofoperations per coefficient, the method of the present invention isrobust and computationally feasible.

Referring now to FIG. 7, watermark detection generally consists of foursteps: video preparation 705, extraction and calculation of propertyvalues 710, detection of bit values 715, and decoding of embedded(watermark) information 720. A test is performed at 725 to determine ifthe watermark information has been successfully decoded. If thewatermark information has been successfully decoded then the process iscomplete. If the watermark information has not been successfully decodedthen the above process can be repeated.

Video preparation itself includes scaling or, re-sampling of the videocontent, synchronization of the video content and filtering:

-   -   Re-sampling of the transformed (distorted) video may have to be        done if the frame rate is different at embedding and detection.        This is often the case, as the frame rate for embedding is 24,        while it can be e.g. 25 (PAL SECAM) or 29.97 (NTSC) at        detection. Re-sampling is performed using linear interpolation.        The output is the resampled video.    -   Filtering the resampled video, typically with a high-pass        temporal filter to diminish the noise due to the cover content        and to emphasize the watermark. The output is the filtered        video.    -   Synchronization of the filtered video can be done either with        the original content using a variety of methods as described        above, or by cross-correlation with synchronization bits if they        were embedded in the video content. Typically, only a temporal        registration would have to be done, if very low spatial        frequencies are used. The global synchronization unit,        optionally assembled together with the local synchronization        units, is used for determining the starting point of the        watermark sequence. A cross-correlation is performed between the        filtered video and the known synchronization bits. There is        typically a strong peak in the cross-correlation function for a        corresponding shift of the video. Referring now to FIG. 8, the        local synchronization process retrieves the next local        synchronization sequence/unit at 805. The video portion        corresponding to the next watermark chip is retrieved at 810.        The video portion and the local synchronization sequence/unit        are cross-correlated at 815. A peak value of cross-correlated        property value P1 is located at 820 and a peak value of property        value P2 is located at 825. A test is made at 830 to determine        if property value P1 is greater than property value P2 plus a        pre-determined value or if property value P1 is less than        property value P2 plus a pre-determined value. If the test        results are negative then the video portion is rejected at 835.        If the test results are positive then the video portion is        retained at 840. A further test is performed at 845 to determine        if the end of the video has been reached. If the end of the        video has been reached then the local synchronization process is        done. If the end of the video has not been reached then the        local synchronization process is repeated. FIG. 9 shows a        cross-correlation function (actually a low pass filtered version        of the magnitude) with two peaks indicating the starting point        of two successive watermark chips. Once the starting point of        the watermark chip is located, the local synchronization units        that are placed at the beginning of each payload are used for        slight realignment of the video at regular intervals. In turn,        each of the 12 local synchronization units is cross-correlated        with the filtered video in a small window around the expected        position. If a comparatively strong correlation peak is found        (as measured by the difference between the highest peak and the        second highest peak), the adjacent filtered video is kept for        next step, otherwise it is discarded. The rationale is that a        stronger correlation peak is an indicator that the filtered        video is more precisely synchronized. The output of this step is        the synchronized video.

The output of the three steps of the video preparation will be denoted‘processed video’ in the following. A processed video is a set of data,which is computed from the received video in order to facilitateextraction/calculation of the property value, which is the next step ofwatermark detection.

In one embodiment of watermark embedding as previously described, theaverage luminance of each of the four quadrants is computed for eachframe. The property values form a vector number of frames×4. For waveletwatermark embedding using LL subband watermarking, the property valuescan be extracted whether from a wavelet or a baseband representation ofthe received video. For both cases, a processed video of size number offrames×4 is obtained. In both of the above schemes the frames areseparated into four parts/tiles from a central point. While this centralpoint can be automatically set to the center point of the frame—as it isin the original video—it naturally has some offset in a camcordercaptured video.

Extracting and computing the property values for wavelet watermarkembedding using LH and HL subbands works slightly differently. ModifyingLH coefficients creates stripes (stripes are equally spaced horizontallines in the baseband video) with a frequency that can be preciselydetermined, at least in the watermarked video before any attack. Thestripes are not visible when the watermark energy is adjusted using amasking model as described above. One can therefore compute thetransformed video by measuring the energy in that frequency (e.g. usinga Fourier transform). However, during a camcorder attack and subsequentcropping of the video, the relevant frequency can be shifted, and itsenergy spread on neighbouring frequencies. Therefore, the energy signalfor all frames is collected in a 5×5 window around the relevantfrequency. Each of these 25 signals is tested for a cross-correlationpeak with the synchronization bit sequence, and the one with the highestpeak is output as the property values.

In watermark detection phase, property values are calculatedcorresponding to how the watermark is embedded. The watermark can beembedded by enforcing at least the following relationships betweenand/or among:

-   -   property values of consecutive frames;    -   one property value of a region of a frame and a pre-determined        value;    -   property values of one region of a frame and another region of        the same frame    -   property values of one region of a frame and the corresponding        region of the consecutive frame

As a property value can also be the coefficient value itself, thewatermark can be embedded by enforcing at least the followingrelationships between and/or among:

-   -   one coefficient value in a video volume and a pre-determined        value;    -   one coefficient value in one subband of a frame and the other        coefficient value at the corresponding position and subband of a        consecutive frame;    -   one coefficient value in one subband of a frame and another        coefficient value at another sub-band of the same frame;        Property values can be calculated in the baseband and/or in the        transform domain. Analogous to watermark embedding, multiple        bits can be detected from the multiple relationships between        and/or among multiple property values.

The first step and the second step of watermark detection can beinterchanged in terms of order. For convenience, it is advantageous ifpossible to compute the property values first because it results in datacompaction (i.e., reduce the entire image data of each frame to a fewvalues per frame), which can be fit into a form from which the watermarkcan be more easily read. However, it may not always be possible toperform the computation of property values first because of seriousdistortion of the video, especially geometric distortion.

The third step receives the property values as input, and outputs themost likely bit value for each of the 127 encoded bits. The propertyvalues may correspond to multiple insertions of each of the encoded 127bits. In an example in accordance with the principles of the presentinvention, in which each bit is inserted at 12 different locations,there can be up to 12 insertions, but less if certain payload units havebeen discarded because of a bad local synchronization.

Referring now to FIG. 10, disjoint sets of coefficients are retrievedfor a next encoded bit at 1005. At 1010, relevant property values arecalculated for the disjoint sets of coefficients. The most likely bitvalue is determined from the calculated property values at 1015. A testis performed at 1020 to determine if there are any more encoded bits. Ifthere are any more encoded bits then the above process is repeated. Anexemplary accumulated signal is depicted in FIG. 11.

Each bit of the encoded payload has been expanded, encrypted andinserted at multiple locations in the content. For each of the expandedbits, as described above, insertion is typically done by setting aconstraint between the property values of two sets of coefficients C1and C2, e.g. P(C1)>P(C2). Suppose there are N such expanded bits andtherefore N such inserted constraints, then:

Bit=1 if P(C1i)>P(C2i) for each i where 1≦i≦N

Bit=0 if P(C1i)<P(C2i) for each i where 1≦i≦N

In general, because of channel noise or the initial impossibility inestablishing the relationship, all the relationships will notnecessarily coincide with the inserted bit. The simplest approach tosolve this problem would be to take a “majority vote”. That is, toselect the bit whose corresponding relationships between coefficientsare observed the most often.

Bit=1 if the number of cases where P(C1i)<P(C2i) (1≦i≦N) is greater thanN/2

Bit=0 otherwise

This approach does not help to resolve cases where N is even, and thenumber of relationships for bit=1 and bit=0 are equal. Furthermore, thisapproach does not take full advantage of the information of P(C1),P(C2), and possibly other information that may increase the likelihoodof correctly determining the relationship. A more refined approachconsists of estimating a probability that the inserted bit value is 1,respectively 0, given the observation of property values P(C1i) andP(C2i). The individually estimated probabilities are combined using aprobabilistic approach, and decision is made based on theMaximum-Likelihood (ML) criterion, where the most probable bit isselected. Other criteria are possible, such as the Neyman-Pearson rule.

Using the ML rule, where the most probable bit is selected, the decisionis based only on the property values. Then the ML rule states:

If: Prob(Bit=1; P(C11),P(C21), . . . , P(C1N),P(C2N))>

Prob(Bit=0; P(C11),(C21), . . . , P(C1N),P(C2N)) Then bit=1

Using Baye's rule, and assuming that each bit value is equi-probable,this can be rewritten as:

Prob (P(C11),P(C21), . . . , P(C1N),P(C2N);bit=1)>

Prob((C11),P(C21), . . . , P(C1N),P(C2N);bit=0)

As the bit is expanded at different pseudo-random locations in thecontent, it can be assumed that the property values are relativelyindependent. That is,

for i=1, . . . , N Prob(P(C 1i),P(C2i);bit=1)/Prob(P(C1i),P(C2i);bit=0)>1

Taking the logarithm:

Sum I=1, . . . ,N(log(Prob(P(C1i),P(C2i);bit=1)−log(Prob(P(C1i),P(C2i);bit=0)))>0

To implement this equation, the equations Prob (P(C1i,P(C2i);bit=1) andProb (P(C1i,P(C2i);bit=1) need to be derived. These equations willdepend on the properties of the channel. The general technique consistsof collecting enough data for estimating this function. Some a prioriknowledge, or assumptions on the probability model (e.g. that thecoefficients or the noise follow a Gaussian distribution) can be used.

Consider the very specific case where the logarithm of the probabilityis proportional to the difference between P(C1i) and P(C2i),symmetrically for bit 1 and bit 0:

Log(a1*Prob(P(C1i),P(C2i);bit=1))=a2*(P(C1i)−P(C2i))

Log(a1*Prob(P(C1i),P(C2i);bit=0))=−a2*(P(C1i)−P(C2i))

Then the rule becomes:

Sum I=1, . . . , N2*a2((P(C1i)−P(C2i)))>0

Or

Sum I=1, . . . , N P(C1i)>Sum I=1, . . . , N P(C2i)

The rule derived for this specific case corresponds to a simplecorrelation, similarly to what is used in spread spectrum system. Thisrule is, however, suboptimal because in general the probability will notvary in a logarithmic way to the difference. This is one reason why themethod of the present invention can be seen as more general, and moreeffective than spread spectrum based methods.

In fact, because of the specific way in which constraints are inserted,i.e. depending on the original content values, it turns out that theprobability is generally not a monotonically increasing function. Toillustrate that, the following simulation was performed in which theestimate of a bit value was compared based on the observation of areceived signal, for respectively the relationship-based approach of thepresent invention and a classic spread spectrum approach.

The original content Gaussian noise X was generated. A binary watermarkW was added to this signal taking its value in [−1,+1]. The binarywatermark was added first following the constraint-based concept in thefollowing way:

If X>a1, Y=X

If X<a2, Y=X

Else Y1=X+r*W

The values a1=0.5, a2=−0.5, r=0.3 were chosen. This resulted in a PSNRof −15 dB.

Then a spread-spectrum watermark was added to the generated signal inthe following way:

Y2=X+a*W

The parameter ‘a’ was adjusted to result in the same PSNR of −155 dB.

The same noise vector N was added to the two signals Y1 and Y2, to get 2received signals R1=Y1+N and R2=Y2+N. The noise also had a PSNR of −10dB with respect to the original content. For the two received contentsR1 and R2, the probability that the embedded bit was ‘1’ given thereceived signal value was estimated. The results are plotted in thegraph depicted in FIG. 12. The difference is striking: as expected, forthe spread-spectrum embedding, the estimated probability that the bit is1 increases linearly with the received signal value. However, for therelationship-based approach of the present invention, the estimatedprobability has a very specific shape going through a minimum then amaximum. This shape can be explained as follows:

-   -   When the cover content has a high or low value, it is most        likely not used for embedding, therefore it is logical that the        received signal is uncorrelated to the bit    -   The estimate is most reliable at −0.5 and +0.5, which are the        minimum/maximum values at which the watermark is embedded        It can, therefore, be concluded that the correct estimate of the        probability is of significant importance to the proper working        of the method of the present invention.

In the last step, once the 127 bit values of the encoded payload areestimated, the 64 bit payload can be decoded, using the BCH decoder.With such a code, up to 10 errors can be detected from the estimatedencoded payload. As described above, this payload contains variousinformation for forensic tracking such as the location/projectoridentifier and timestamp in a digital cinema application. Thisinformation is extracted from the decoded payload and allows for a widerange of uses such as forensic tracking down the potential fraud thatoccurred.

In case of a failure in the last step (i.e. no valid watermarkinformation is decoded), the above four steps can be repeated with adifferent strategy (e.g. optimized synchronization and registration forthe video in the first step) for each step until a watermark informationis successfully decoded or reaching a maximum number of such trials.

It is to be understood that the present invention may be implemented invarious forms of hardware (e.g. ASIC chip), software, firmware, specialpurpose processors, or a combination thereof, for example, within aserver or mobile device. Preferably, the present invention isimplemented as a combination of hardware and software. Moreover, thesoftware is preferably implemented as an application program tangiblyembodied on a program storage device. The application program may beuploaded to, and executed by, a machine comprising any suitablearchitecture. Preferably, the machine is implemented on a computerplatform having hardware such as one or more central processing units(CPU), a random access memory (RAM), and input/output (I/O)interface(s). The computer platform also includes an operating systemand microinstruction code. The various processes and functions describedherein may either be part of the microinstruction code or part of theapplication program (or a combination thereof), which is executed viathe operating system. In addition, various other peripheral devices maybe connected to the computer platform such as an additional data storagedevice and a printing device.

It is to be further understood that, because some of the constituentsystem components and method steps depicted in the accompanying figuresare preferably implemented in software, the actual connections betweenthe system components (or the process steps) may differ depending uponthe manner in which the present invention is programmed. Given theteachings herein, one of ordinary skill in the related art will be ableto contemplate these and similar implementations or configurations ofthe present invention.

1-34. (canceled)
 35. A method for detecting watermarks in video images,said method comprising decoding a payload, wherein said payloadcomprises a bit sequence generated and embedded by enforcingrelationships between said property values within a volume of video. 36.The method according to claim 35, wherein said payload is decoded froman estimated encoded payload.
 37. The method according to claim 35,further comprising extracting and calculating property values andwherein said property values are calculated from one of a spatial domainand a transform domain of said volume of video.
 38. The method accordingto claim 37, further comprising detecting bit values and wherein atleast one pre-determined value and one of said property values are usedfor detecting one bit of said payload.
 39. The method according to claim38, wherein at least two of said property values are used for detectingone bit of said payload.
 40. The method according to claim 38, whereinat least one bit of said payload is detected from at least onerelationship between or among said property values.
 41. The methodaccording to claim 37, further comprising calculating a first propertyvalue from a first region of a first frame of said volume of video. 42.The method according to claim 41, further comprising calculating asecond property value from a first region of a second frame of saidvolume of video.
 43. The method according to claim 41, furthercomprising calculating a second property value from a second region ofsaid first frame of said volume of video.
 44. The method according toclaim 42, wherein said second frame is a consecutive frame of said firstframe, and said first property value and said second property value arecalculated in a same manner from said first and said second frames,respectively.
 45. The method according to claim 43, wherein said firstproperty value is calculated from a top region of said first frame, andsaid second property value is calculated from a bottom region of saidfirst frame.
 46. The method according to claim 42, wherein said firstproperty value is calculated from a top region of said first frame, andsaid second property value is calculated from a top region of saidsecond frame.
 47. The method according to claim 43, wherein said firstframe of said volume of video is divided into four tiles from a centerpoint of said first frame.
 48. The method according to claim 47, whereinsaid first property value is calculated from a first one of said fourtiles and said second property value is calculated from a second one ofsaid four tiles.
 49. The method according to claim 42, wherein saidfirst frame of said volume of video is divided into four tiles from acenter point of said first frame and said second frame of said volume ofvideo is correspondingly divided into four tiles from a center point ofsaid second frame.
 50. The method according to claim 49, wherein saidfirst property value is calculated from one of said four tiles of saidfirst frame, and said second property value is calculated from acorresponding one of said four tiles of said second frame.
 51. Themethod according to claim 35, further comprising preparing a signal,said signal including a watermark and wherein said preparing stepfurther comprises: re-sampling said signal when an encoding frame rateis different than a detecting frame rate; filtering said signal; andsynchronizing said signal.
 52. The method according to claim 51, whereinsaid re-sampling step is performed using linear interpolation.
 53. Themethod according to claim 52, wherein said filtering step is performedusing a high-pass filter in order to reduce noise and emphasize thewatermark signal.
 54. The method according to claim 52, wherein saidsynchronizing step is used to determine the starting point of thewatermark.
 55. The method according to claim 54, wherein saidsynchronizing step is performed using an original video content.
 56. Themethod according to claim 54, wherein said synchronizing step isperformed by cross-correlation with synchronization bits.
 57. The methodaccording to claim 52, wherein said synchronizing step furthercomprises: synchronizing said signal globally; and synchronizing saidsignal locally.
 58. The method according to claim 38, wherein saiddetecting step further comprises accumulating a payload signal.
 59. Themethod according to claim 58, wherein said accumulating step furthercomprises reading an encrypted payload signal with a key to retrieve anencoded signal.
 60. The method according to claim 59, wherein saidencrypted payload signal is a replicated payload signal.
 61. The methodaccording to claim 59, further comprising selecting a value for a bit ina bit sequence representing said watermark whose correspondingrelationship between said property values occurs most frequently. 62.The method according to claim 59, further comprising selecting a mostprobable value for a bit in a bit sequence representing said watermarkbased on a maximum-likelihood criteria.
 63. The method according toclaim 62, wherein said most probable value for a bit is based oncombined individual estimated probabilities.
 64. A system for detectingwatermarks in video images, comprising means for decoding a payload,wherein said payload comprises a bit sequence generated and embedded byenforcing relationships between property values within a volume ofvideo.
 65. The system according to claim 64, further comprising meansfor extracting and calculating property values.
 66. The system accordingto claim 65 further comprising means for detecting bit values.
 67. Thesystem according to claim 64 further comprising means for preparing asignal, wherein said signal includes a watermark.